In Module 3, we used TEI to mark up primary documents. Melodee Beals has been using TEI to markup newspaper articles, creating the Colonial Newspapers Database (which she shared on github). We then used Github Pages and an XLST stylesheet to convert that database into a table of comma-separated values https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv. We are now going to topic model the text of those newspaper articles, to see what patterns of discourse may lie within.

Getting Started

First we need to set up our workspace. On my machine, I’ve created a folder on my desktop called ‘Beals’ to keep all of this neat and tidy.

setwd("desktop/beals-new")
# give yourself as much memory as you've got
options(java.parameters = "-Xmx5120m")
#install.packages('rJava')
library(rJava)
## from http://cran.r-project.org/web/packages/mallet/mallet.pdf
#install.packages('mallet')
library(mallet)

The first line sets our working directory to that new folder. The next line ups the available memory for java (since our topic modeling tool, MALLET, uses Java). Then we tell R Studio to use the rJava and the Mallet libraries. If you were running that code above for the first time, you would remove the # in front of the ‘install.packages’ command. Copy the code above into a new R script document in R studio. Put the cursor in the first line; hit ctrl+enter to run the line through the console.

Getting the data

Now we want to tell R Studio to grab our data from our github page. The thing is, R Studio can easily grab materials from websites where the url is http; but when it is https (as it is with github), things get a bit more fussy. So what we do is use a special package to grab the data, and then shove it into a variable that we can then tease apart for our analysis.

library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:rJava':
## 
##     clone
x <- getURL("https://raw.githubusercontent.com/shawngraham/exercise/gh-pages/CND.csv")

documents <- read.csv(text = x, col.names=c("Article_ID", "Newspaper Title", "Newspaper City", "Newspaper Province", "Newspaper Country", "Year", "Month", "Day", "Article Type", "Text", "Keywords"),
                        colClasses=rep("character", 3), sep=",", quote="")

Remember how we used ‘curl’ in one of our scripts in Module 3 to grab data from the Canadiana.org api? RCurl is its doppleganger within R. So, we create a variable called ‘x’ and we pass to it the Colonial Newspaper Database as csv. Then, we read the contents of that csv file and tell R what the structure of the data is. All of this gets passed to a variable called ‘documents’. Incidentally, if we then wanted to look only at the keywords in this table, we would do that by querying ‘documents$Keywords’. The $ is quite handy!

Now, let’s take a look at our data.

counts <- table(documents$Newspaper.City)
barplot(counts, main="Cities", xlab="Number of Articles")

plot of chunk unnamed-chunk-3 Clearly, we’re getting an Edinburgh/Glasgow perspective on things. And somewhere in our data, there’s a mispelled ‘Edinbugh’.

years <- table(documents$Year)
barplot(years, main="Publication Year", xlab="Year", ylab="Number of Articles")

plot of chunk unnamed-chunk-4

There’s a lot of material in 1789, another peak around 1819, againg in the late 1830s. We can ask ourselves now: is this an artefact of the data, or of our collection methods? This would be a question a reviewer would want answers to. Let’s assume for now that these two plots are ‘true’ - that, for whatever reasons, only Edinburgh and Glasgow were concerned with these colonial reports, and that they were particulary interested during those three periods. This is already an interesting question that we as historians would want to explore. This quick distant look now necessitates some close reading - and back again!

A distant read

Let’s look at the words themselves in our data (and, as an aside, you might read this by David Mimno)

mallet.instances <- mallet.import(documents$Article_ID, documents$Text, "/Users/shawngraham/Desktop/data mining and tools/TextAnalysisWithR/data/stoplist.csv", token.regexp = "\\p{L}[\\p{L}\\p{P}]+\\p{L}")

That line above passes the article ID and the text of our newspaper articles to the Mallet routine. Note also the path to my stoplist. That’s a generic one I use all the time, when I’m first playing with topic models. You might want to create your own, one for each project given the particulars of your project. Here’s a stoplist you can use http://bridge.library.wisc.edu/jockersStopList.txt Download that, and save it into your working directory. Then, in the block above where I’ve got the location of my stoplist, just put “jockersStopList.txt” and R Studio will look for it in your working directory. Note that Jockers compiled this stoplist for his research in literary history of the 19th century. Your mileage may vary! Finally, the last bit after ‘token.regexp’ applies a regular expression against our newspaper articles, cleaning them up.

#set the number of desired topics
num.topics <- 50
topic.model <- MalletLDA(num.topics)

Now we’ve told Mallet how many topics to search for; this is a number you’d want to fiddle with, to find the ‘best’ number of topics. The next line creates a variable ‘topic.model’ which will eventually be filled by Mallet using the LDA approach, for 50 topics. Let’s get some info on our topic model, on our distribution of words in these newspaper articles.

topic.model$loadDocuments(mallet.instances)
## Get the vocabulary, and some statistics about word frequencies.
## These may be useful in further curating the stopword list.
vocabulary <- topic.model$getVocabulary()
word.freqs <- mallet.word.freqs(topic.model)
head(word.freqs)
##            words term.freq doc.freq
## 1       commerce         3        3
## 2 canada.extract         1        1
## 3         letter       122       83
## 4     population        69       41
## 5         canada       137       48
## 6       reckoned         1        1
write.csv(word.freqs, "cnd-word-freqs.csv" )

As the comment indicates, if we take a look at word frequencies right now, before we generate the model, we can see if there are words that would be better added to our stopword list. The final line writes these word frequencies to a new csv file. Do you see how you might create a bar plot of word frequencies?

Now we do the heavy lifting: generating a topic model:

## Optimize hyperparameters every 20 iterations,
## after 50 burn-in iterations.
topic.model$setAlphaOptimization(20, 50)
## Now train a model. Note that hyperparameter optimization is on, by default.
## We can specify the number of iterations. Here we'll use a large-ish round number.
topic.model$train(1000)
## Run through a few iterations where we pick the best topic for each token,
## rather than sampling from the posterior distribution.
topic.model$maximize(10)
## Get the probability of topics in documents and the probability of words in topics.
## By default, these functions return raw word counts. Here we want probabilities,
## so we normalize, and add "smoothing" so that nothing has exactly 0 probability.
doc.topics <- mallet.doc.topics(topic.model, smoothed=T, normalized=T)
topic.words <- mallet.topic.words(topic.model, smoothed=T, normalized=T)

Let’s look at some of our topics.

## What are the top words in topic 7?
## Notice that R indexes from 1, so this will be the topic that mallet called topic 6.
mallet.top.words(topic.model, topic.words[7,])
##     words weights
## 1    wool 0.03308
## 2   bales 0.01854
## 3  market 0.01722
## 4   wools 0.01722
## 5   sales 0.01589
## 6   south 0.01325
## 7  prices 0.01325
## 8   wales 0.01193
## 9  german 0.01193
## 10    van 0.01193

Now we’ll write the distribution of the topics by document (ie newspaper article) to a csv file that we could explore/visualize with other tools. Then, we’ll take a look at the key words describing each topic.

topic.docs <- t(doc.topics)
topic.docs <- topic.docs / rowSums(topic.docs)
write.csv(topic.docs, "cnd-topics-docs.csv" ) 
## Get a vector containing short names for the topics
topics.labels <- rep("", num.topics)
for (topic in 1:num.topics) topics.labels[topic] <- paste(mallet.top.words(topic.model, topic.words[topic,], num.top.words=5)$words, collapse=" ")
# have a look at keywords for each topic
topics.labels
##  [1] "bill house hon hear committee"                       
##  [2] "fowler chiefs island anna maria"                     
##  [3] "kingston niagara americans fleet lake"               
##  [4] "indians killed river number indian"                  
##  [5] "gold bank coin silver price"                         
##  [6] "whale boats south work sea"                          
##  [7] "wool bales market wools sales"                       
##  [8] "government cape settlers number persons"             
##  [9] "company esq colonial paid directors"                 
## [10] "church meeting commission members rev"               
## [11] "town fire man house evening"                         
## [12] "negroes people negro colour arms"                    
## [13] "ship board number captain vessel"                    
## [14] "land lord john meeting favour"                       
## [15] "south wales emigration colony colonies"              
## [16] "country great made time state"                       
## [17] "convicts society punishment transportation prisoners"
## [18] "feet horses children white wife"                     
## [19] "free slaves colonial governor island"                
## [20] "british press french india europe"                   
## [21] "south spanish coast king buenos"                     
## [22] "petition pay paper committee resolved"               
## [23] "river governor miles bathurst western"               
## [24] "men community degree emigration manners"             
## [25] "ditto june april sept john"                          
## [26] "good make family find articles"                      
## [27] "land south miles north timber"                       
## [28] "paisley council edinburgh hamilton subscribed"       
## [29] "years population increase distress currency"         
## [30] "wild woodland mild cheer heart"                      
## [31] "quakers cherokee chiefs thought world"               
## [32] "canada number emigrants emigration persons"          
## [33] "canada upper province british subjects"              
## [34] "labour colony situation society produce"             
## [35] "men general army troops fort"                        
## [36] "states united vessels british colonies"              
## [37] "boat boats fired ball barrels"                       
## [38] "peru town lima produce people"                       
## [39] "king chief natives people crew"                      
## [40] "states united america land york"                     
## [41] "duke death military wellington grace"                
## [42] "edinburgh hospital gentlemen court medical"          
## [43] "west company north river fort"                       
## [44] "arrived received letter left days"                   
## [45] "years person living age correspondent"               
## [46] "peace law laws proper citizens"                      
## [47] "american war british america britain"                
## [48] "james ireland query emigrants irishmen"              
## [49] "president full nootka washington columbia"           
## [50] "author fair dramatic maraino consistent"
write.csv(topics.labels, "cnd-topics-labels.csv")

Some interesting patterns suggest themselves already! But a list of words doesn’t capture the relative importance of particular words in particular topics. A word might appear in more than one topic, for instance, but really dominate one rather than the other. Word-clouds, that much-maligned technique, is actually rather useful in this regard. Let’s visualize these topic words.

### do word clouds of the topics
library(wordcloud)
## Loading required package: RColorBrewer
for(i in 1:num.topics){
  topic.top.words <- mallet.top.words(topic.model,
                                      topic.words[i,], 25)
  print(wordcloud(topic.top.words$words,
                  topic.top.words$weights,
                  c(4,.8), rot.per=0,
                  random.order=F))
}

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL
## Warning: administration could not be fit on page. It will not be plotted.

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL
## Warning: departed could not be fit on page. It will not be plotted.
## Warning: shaking could not be fit on page. It will not be plotted.
## Warning: pretend could not be fit on page. It will not be plotted.
## Warning: delighted could not be fit on page. It will not be plotted.
## Warning: londonderry could not be fit on page. It will not be plotted.

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL
## Warning: conversion failure on 'resolved–that' in 'mbcsToSbcs': dot substituted for <e2>
## Warning: conversion failure on 'resolved–that' in 'mbcsToSbcs': dot substituted for <80>
## Warning: conversion failure on 'resolved–that' in 'mbcsToSbcs': dot substituted for <93>
## Warning: conversion failure on 'resolved–that' in 'mbcsToSbcs': dot substituted for <e2>
## Warning: conversion failure on 'resolved–that' in 'mbcsToSbcs': dot substituted for <80>
## Warning: conversion failure on 'resolved–that' in 'mbcsToSbcs': dot substituted for <93>
## Warning: font metrics unknown for Unicode character U+2013

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

plot of chunk unnamed-chunk-11

## NULL

Note that when you run all this in RStudio, your plots will appear in the ‘plots’ window in the bottom right quadrant of the interface, and you can export them as png or pdf. If you export as pdf, you can then open them in a vector graphics program like Inkscape or Illustrator, and modify the individual elements to make a more pleasing visual.

Finally, let’s do some clustering. We can cluster topics based on how they share words in your corpus:

## cluster based on shared words
plot(hclust(dist(topic.words)), labels=topics.labels)

plot of chunk unnamed-chunk-12

If we want to get really fancy, we can make a network visualization of how topics interlink due to their distribution in documents.

topic_docs <- data.frame(topic.docs)
names(topic_docs) <- documents$article_id

#install.packages("cluster")
library(cluster)
topic_df_dist <- as.matrix(daisy(t(topic_docs), metric = "euclidean", stand = TRUE))
# Change row values to zero if less than row minimum plus row standard deviation
# keep only closely related documents and avoid a dense spagetti diagram
# that's difficult to interpret (hat-tip: http://stackoverflow.com/a/16047196/1036500)
topic_df_dist[ sweep(topic_df_dist, 1, (apply(topic_df_dist,1,min) + apply(topic_df_dist,1,sd) )) > 0 ] <- 0

#install.packages("igraph")
library(igraph)
g <- as.undirected(graph.adjacency(topic_df_dist))
layout1 <- layout.fruchterman.reingold(g, niter=500)
plot(g, layout=layout1, edge.curved = TRUE, vertex.size = 1, vertex.color= "grey", edge.arrow.size = 0, vertex.label.dist=0.5, vertex.label = NA)

plot of chunk unnamed-chunk-13

write.graph(g, file="cnd.graphml", format="graphml")

There are many ways of visualizing and transforming our data. This document only captures a small fraction of the kinds of things you could do. Another good exploration is at http://bridge.library.wisc.edu/hw1a-Rcoding-Jockers.html. Ben Marwick does really fun things with the Day of Archaeology blog posts https://github.com/benmarwick/dayofarchaeology and indeed, some of the code above comes from Marwick’s explorations. Keep your R scripts in your open notebook, and somebody might come along and use them, cite them, improve them, share them! Keep also all your data. Here’s an example from my own work https://github.com/shawngraham/ferguson.